In-class Ex 5

Author

Hulwana

1 Overview

In this exercise, we will build an explanatory model to discover factors affecting water points status at Osun state of Nigeria. We will focus only on functional and non-functional water points data and exclude unknown water points.

1.1 Load the Packages

For our analysis, we will utilize the following packages:

  1. Data Wrangling:
  • sf - for importing and processing geospatial data,

  • tidyverse - for importing and processing non-spatial data. In this exercise, readr package will be used for importing wkt data and dplyr package will be used to wrangling the data,

  1. Visualisation:
  • tmap - for preparation of cartographic quality choropleth map,

  • funModeling - for rapid Exploratory Data Analysis,

  • ggpubr - for creating and customizing ‘ggplot2’- based publication ready plots,

  • corrplot - visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables,

  • skimr - for us to do EDA

  1. Correlation and Clustering Analysis:
  • spdep - for computation of spatial weights, global and local spatial autocorrelation statistics,

  • blorr - for logistic regression

  • caret - designed for machine learning purposes, in our case we will use it to do our logistic regression

We will run the following code chunk to load the required packages:

pacman::p_load(sf, tidyverse, tmap, spdep, funModeling, ggpubr, corrplot, 
               heatmaply, cluster, blorr, GWmodel, skimr, caret)

1.2 Load the Data

For Osun we will retain ADM2_EN, ADM2_PCODE, ADM1_EN and ADM1_PCODE to retain the LGA and state.

Osun_wp_sf contains 4760 observations and thus sufficient number of data to do our logistic regression.

Field names with ‘#’ has also been cleaned up prior to the load.

Unknown water points also have been excluded.

A new field called ‘status’ has been created from the variable ‘status_clean’ by mutating it either to T or F

when it is classified under functional or non-functional water points respectively.

Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")

1.3 EDA

Osun_wp_sf %>% freq(input = 'status')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0

Observe that this region has a relatively high proportion of non-functional water points at 44.5% which is higher than the average proportion of non-functional water points of 22.5% across regions in Nigeria.

1.3.1 Summary Statistics with skimr

In logisticregression model, if there is at least one missing value that observation will be excluded from the analysis. Additionally, if there are a high proportion of missing values (even if it at 20%), then we should not use that variable even if that variable might be a useful field in our analysis.

Skimr enables us to retain the table into tibble form. We will get 4 output tables.

The first table it summarises the number of rows and columns, and the number of different data types.

The second report provides information on how many missing values, the percentage number of data not missing, the min and max values.

We see that water_point_population and local_population_1km has 4 missing data points.

Osun_wp_sf %>% skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

1.3.2 Remove Missing Values based on certain variables

  • Exclude all the missing values for the relevant fields. We will 4 observation values less due to the missing value in water_point_population and local_population_1km.

  • Recode the usage capacity from numerical to factor/categorical. As the usage capacity consists of discrete values mainly those with usage capacity of 300 and 1000.

Osun_wp_sf_clean <- Osun_wp_sf %>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.)))  %>%
  mutate(usage_capacity = as.factor(usage_capacity))

1.3.3 Correlation Analysis

When we deal with spatial data and want to plot the correlation matrix, we need to drop the geometric column.

Osun_wp <- Osun_wp_sf_clean %>%
  select(c(7,35:39,42:43,46:47,57)) %>%
  st_set_geometry(NULL)

Plotting the correlation matrix.

cluster_vars.cor = cor(Osun_wp[,2:8])

corrplot.mixed(cluster_vars.cor,
               lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

model <- glm(status ~ distance_to_primary_road + 
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sf_clean,
             family = binomial(link = 'logit'))

To look at the report, we use the blr_regress() function from blorr package to convert

blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

We observe that the variables distance_to_primary_road and distance_to_secondary_road has a p-value greater than a significance level of 0.05 and thus should be excluded since they are not statistically significant.

For categorical variables, a positive value implies an above average correlation and a negative value implies a below average correlation.

We need a separate call, to assess how the overall model work.

We assign a cutoff value of 0.5.

Results:

Overall accuracy is 0.6739

We see that the Sensitivity is higher than Specificity as there are more true positive than true negative.

blr_confusion_matrix(model, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

Can we improve the model by incorporating geospatial factors?

2 Geographically weighted regression

2.1 SP dataframe

2.1.1 Select the required variables and convert it into SP dataframe

We will first need to a spatial point dataframe. We used the cleaned version that removes the 4 records with missing value as when we callibrate the missing value would disappear but the geometric polygons will retian and this will result in error.

Osun_wp_sp <- Osun_wp_sf_clean %>%
  select(c(status,
           distance_to_primary_road, 
           distance_to_secondary_road,
           distance_to_tertiary_road,
           distance_to_city,
           distance_to_town,
           water_point_population,
           local_population_1km,
           is_urban,
           usage_capacity,
           water_source_clean)) %>%
  as_Spatial()

Osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,        0,           1000,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,        1,            300,   Protected Spring 

2.2 Fixed distance matrix

We set longlat as FALSE as we already have converted it into projected coordinated system and obtained the optimal bandwidth using the following code:

bw.fixed <- bw.ggwr(status ~ distance_to_primary_road + 
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               water_point_population +
               local_population_1km +
               is_urban +
               usage_capacity +
               water_source_clean,
               data = Osun_wp_sp,
               family = "binomial",
               approach = "AIC",
               kernel = "gaussian",
               adaptive = FALSE,
               longlat = FALSE)
bw.fixed

2.3 Geographically Weighted Logistic Regression Model

2.3.1 Compute the best model

gwlr.fixed <- ggwr.basic(status ~
               distance_to_primary_road + 
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               water_point_population +
               local_population_1km +
               is_urban +
               usage_capacity +
               water_source_clean,
               data = Osun_wp_sp,
               #bw = bw.fixed,
               bw = 2599.672,
               family = "binomial",
               kernel = "gaussian",
               adaptive = FALSE,
               longlat = FALSE)
Warning in proj4string(data): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
Warning in proj4string(regression.points): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
 Iteration    Log-Likelihood
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 

For the Geographically Weighted Regression model, we only have AIC so we will compare based on AIC and see that there is a reduction in AIC value.

gwlr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-17 14:43:07 
   Call:
   ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road + 
    distance_to_tertiary_road + distance_to_city + distance_to_town + 
    water_point_population + local_population_1km + is_urban + 
    usage_capacity + water_source_clean, data = Osun_wp_sp, bw = 2599.672, 
    family = "binomial", kernel = "gaussian", adaptive = FALSE, 
    longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km is_urban usage_capacity water_source_clean
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-124.555    -1.755     1.072     1.742    34.333  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.887e-01  1.124e-01   3.459 0.000543
distance_to_primary_road                 -4.642e-06  6.490e-06  -0.715 0.474422
distance_to_secondary_road               -5.143e-06  9.299e-06  -0.553 0.580230
distance_to_tertiary_road                 9.683e-05  2.073e-05   4.671 3.00e-06
distance_to_city                         -1.686e-05  3.544e-06  -4.757 1.96e-06
distance_to_town                         -1.480e-05  3.009e-06  -4.917 8.79e-07
water_point_population                   -5.097e-04  4.484e-05 -11.369  < 2e-16
local_population_1km                      3.451e-04  1.788e-05  19.295  < 2e-16
is_urbanTRUE                             -2.971e-01  8.185e-02  -3.629 0.000284
usage_capacity1000                       -6.230e-01  6.972e-02  -8.937  < 2e-16
water_source_cleanProtected Shallow Well  5.040e-01  8.574e-02   5.878 4.14e-09
water_source_cleanProtected Spring        1.288e+00  4.388e-01   2.936 0.003325
                                            
Intercept                                ***
distance_to_primary_road                    
distance_to_secondary_road                  
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
water_point_population                   ***
local_population_1km                     ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.0  on 4744  degrees of freedom
AIC: 5712

Number of Fisher Scoring iterations: 5


 AICc:  5712.099
 Pseudo R-square value:  0.1295351
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2599.672 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -8.7229e+02 -4.9955e+00  1.7600e+00
   distance_to_primary_road                 -1.9389e-02 -4.8031e-04  2.9618e-05
   distance_to_secondary_road               -1.5921e-02 -3.7551e-04  1.2317e-04
   distance_to_tertiary_road                -1.5618e-02 -4.2368e-04  7.6179e-05
   distance_to_city                         -1.8416e-02 -5.6217e-04 -1.2726e-04
   distance_to_town                         -2.2411e-02 -5.7283e-04 -1.5155e-04
   water_point_population                   -5.2208e-02 -2.2767e-03 -9.8875e-04
   local_population_1km                     -1.2698e-01  4.9952e-04  1.0638e-03
   is_urbanTRUE                             -1.9790e+02 -4.2908e+00 -1.6864e+00
   usage_capacity1000                       -2.0772e+01 -9.7231e-01 -4.1592e-01
   water_source_cleanProtected.Shallow.Well -2.0789e+01 -4.5190e-01  5.3340e-01
   water_source_cleanProtected.Spring       -5.2235e+02 -5.5977e+00  2.5441e+00
                                                3rd Qu.      Max.
   Intercept                                 1.2763e+01 1073.2156
   distance_to_primary_road                  4.8443e-04    0.0142
   distance_to_secondary_road                6.0692e-04    0.0258
   distance_to_tertiary_road                 6.6815e-04    0.0128
   distance_to_city                          2.3718e-04    0.0150
   distance_to_town                          1.9271e-04    0.0224
   water_point_population                    5.0102e-04    0.1309
   local_population_1km                      1.8157e-03    0.0392
   is_urbanTRUE                              1.2841e+00  744.3099
   usage_capacity1000                        3.0322e-01    5.9281
   water_source_cleanProtected.Shallow.Well  1.7849e+00   67.6343
   water_source_cleanProtected.Spring        6.7663e+00  317.4133
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2795.084 
   AIC : 4414.606 
   AICc : 4747.423 
   Pseudo R-square value:  0.5722559 

   ***********************************************************************
   Program stops at: 2022-12-17 14:43:56 

We extract the relevant results and store it into a dataframe.

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Next , we will label yhat values greater or equal to 0.5 into 1 and else 0. The result of the logic comparison operation will be saved into a field called most.

gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat  >= 0.5, T, F))
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)

CM <- confusionMatrix(data = gwr.fixed$most, reference = gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1824  263
     TRUE    290 2379
                                          
               Accuracy : 0.8837          
                 95% CI : (0.8743, 0.8927)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7642          
                                          
 Mcnemar's Test P-Value : 0.2689          
                                          
            Sensitivity : 0.8628          
            Specificity : 0.9005          
         Pos Pred Value : 0.8740          
         Neg Pred Value : 0.8913          
             Prevalence : 0.4445          
         Detection Rate : 0.3835          
   Detection Prevalence : 0.4388          
      Balanced Accuracy : 0.8816          
                                          
       'Positive' Class : FALSE           
                                          

2.3.2 Visualising the Result

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,
           ADM1_EN, ADM1_PCODE,
           status))
gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)
tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(9,14))

prob_T

We obtained a better accuracy from the local model and thus when implementing strategies it should be strategised at a local level rather than a global level.

The model can be further improved by calibrating the model with the 2 insignificant variables excluded.

2.4 Improved Model

2.4.1 Compute the best model

We will exclude the 2 variables that are statistically insignificant (with a p-value of greater than 0.05).

gwlr.fixed.v2 <- ggwr.basic(status ~
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               water_point_population +
               local_population_1km +
               is_urban +
               usage_capacity +
               water_source_clean,
               data = Osun_wp_sp,
               #bw = bw.fixed,
               bw = 2599.672,
               family = "binomial",
               kernel = "gaussian",
               adaptive = FALSE,
               longlat = FALSE)
Warning in proj4string(data): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
Warning in proj4string(regression.points): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
 Iteration    Log-Likelihood
=========================
       0        -2035 
       1        -1773 
       2        -1636 
       3        -1562 
       4        -1531 
       5        -1531 

Extract the model results and save it into a dataframe

gwr.fixed.v2 <- as.data.frame(gwlr.fixed.v2$SDF)

Next , we will label yhat values greater or equal to 0.5 into 1 and else 0. The result of the logic comparison operation will be saved into a field called most.

gwr.fixed.v2 <- gwr.fixed.v2 %>%
  mutate(most = ifelse(
    gwr.fixed.v2$yhat  >= 0.5, T, F))
gwr.fixed.v2$y <- as.factor(gwr.fixed.v2$y)
gwr.fixed.v2$most <- as.factor(gwr.fixed.v2$most)

CM <- confusionMatrix(data = gwr.fixed.v2$most, reference = gwr.fixed.v2$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1792  302
     TRUE    322 2340
                                          
               Accuracy : 0.8688          
                 95% CI : (0.8589, 0.8783)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7341          
                                          
 Mcnemar's Test P-Value : 0.4469          
                                          
            Sensitivity : 0.8477          
            Specificity : 0.8857          
         Pos Pred Value : 0.8558          
         Neg Pred Value : 0.8790          
             Prevalence : 0.4445          
         Detection Rate : 0.3768          
   Detection Prevalence : 0.4403          
      Balanced Accuracy : 0.8667          
                                          
       'Positive' Class : FALSE           
                                          

Comparing the Accuracy rate without the 2 insignificant variables (distance_to_primary_road and distance_to_secondary_road), the value drops minimally from 0.8837 to 0.8688. We see that in terms of sensitivity and specificity, the value also drops minimally from 0.8628 to 0.8477 and 0.9005 to 0.8857 respectively.

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,
           ADM1_EN, ADM1_PCODE,
           status))
gwr_sf.fixed.v2 <- cbind(Osun_wp_sf_selected, gwr.fixed.v2)

2.4.2 Visualising the Results

tmap_mode("view")
tmap mode set to interactive viewing
prob_T2 <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed.v2) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(9,14))

prob_T2
tmap_mode("plot")
tmap mode set to plotting